Introduction

This tutorial introduces lexicography with R and shows how to use R to create dictionaries and find synonyms through determining semantic similarity in R. While the initial example focuses on English, subsequent sections show how easily this approach can be generalized to languages other than English (e.g. German, French, Spanish, Italian, or Dutch). The entire R-markdown document for the sections below can be downloaded here.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install libraries
install.packages(c("tidyr", "tidytext", "quanteda"))

Once you have installed R Studio and initiated the session by executing the code shown above, you are good to go.

1 Creating a basic dictionary for English

In a first step, we load the necessary packages from the library and define the location of the engine which we use for the part-of-speech tagging. In this case, we will use the TreeTagger (see Schmid 1994, 2013; Schmid et al. 2007,). How to install and then use the TreeTagger for English as well as for German, French, Spanish, Italian, and Dutch is demonstrated and explained here.


NOTE

You will have to install TreeTagger and change the path used below ("C:\\TreeTagger\\bin\\tag-english.bat") to the location where you have installed TreeTagger on your machine. If you do not know how to install TreeTagger or encounter problems, read this tutorial!

In addition, you can download the pos-tagged text here so you can sinply skip the next code chunk and load the data as shown below.


# activate packages
library(tidyr)
library(tidytext)
library(quanteda)
library(koRpus)
library(koRpus.lang.en)
library(DT)
# define location of pos-tagger engine
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en") 

In a next step, we load and process the data which in this tutorial represents the text from George Orwell’s Nineteen Eighty-Four. We will not pre-process the data by for instance repairing broken or otherwise compromised words and continue by directly implementing the part-of-speech tagger.

# load and pos-tag data
#orwell_pos <- treetag("https://slcladal.github.io/data/orwell.txt")
orwell_pos <- treetag("data/orwell.txt")
# select data frame
orwell_pos <- orwell_pos@tokens
# write data to disc
#write.table(orwell_pos, "D:\\Uni\\UQ\\SLC\\LADAL\\SLCLADAL.github.io\\data/orwell_pos.txt", sep = "\t", row.names = F)
# inspect  results
datatable(orwell_pos, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

If you could not pos-tag the text, you can simply execute the following code chunk which loads the pos-tagged text from the LADAL repository.

# load pos-taged data
#orwell_pos <- read.delim("https://slcladal.github.io/data/orwell_pos.txt", sep = "\t", header = T)
orwell_pos <- read.delim("D:\\Uni\\UQ\\SLC\\LADAL\\SLCLADAL.github.io\\data/orwell_pos.txt", sep = "\t", header = T)
# inspect  results
datatable(orwell_pos, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

We can now use the resulting table to generate a first, basic dictionary that holds information about the word form (token), the part-of speech tag (tag), the lemmatized word type (lemma), the general word category (wclass), and the frequency with which the word form is used as that part-of speech.

# generate dictionary
orwell_dic_raw <- orwell_pos %>%
  dplyr::select(token, tag, lemma, wclass) %>%
  dplyr::group_by(token, tag, lemma, wclass) %>%
  dplyr::summarise(frequency = dplyr::n()) %>%
  dplyr::arrange(lemma)
# inspect  results
datatable(orwell_dic_raw, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

Cleaning dictionary entries

However, as the resulting table shows, the data is still very noisy, i.e. it contains a lot of non-words, i.e words that may be mis-spelled, broken, or otherwise compromised. In order to get rid of these, we can simply check if the word lemma exists in an existing dictionary. When you aim to identify exactly those words that are not yet part of an established dictionary, you could of course do it the other way around and remove all words that are already present in an existing dictionary.

library(hunspell)
# generate dictionary
orwell_dic_clean <- orwell_dic_raw %>%
  dplyr::filter(hunspell_check(lemma)) %>%
  dplyr::filter(!stringr::str_detect(lemma, "\\W\\w{1,}"))
# inspect  results
head(orwell_dic_clean, 20)
## # A tibble: 20 x 5
## # Groups:   token, tag, lemma [20]
##    token       tag   lemma       wclass      frequency
##    <chr>       <chr> <chr>       <chr>           <int>
##  1 .           SENT  .           fullstop         5607
##  2 ...         :     ...         punctuation         7
##  3 3rd         JJ    3rd         adjective           1
##  4 4th         JJ    4th         adjective           3
##  5 a           DT    a           determiner       2277
##  6 A           DT    a           determiner        110
##  7 A           NP    A           name                3
##  8 aback       RB    aback       adverb              2
##  9 abandon     VV    abandon     verb                3
## 10 abandoned   VVD   abandon     verb                1
## 11 abandoned   VVN   abandon     verb                3
## 12 abashed     VVN   abash       verb                1
## 13 abbreviated VVN   abbreviate  verb                1
## 14 abiding     JJ    abiding     adjective           1
## 15 ability     NN    ability     noun                1
## 16 abject      JJ    abject      adjective           3
## 17 able        JJ    able        adjective          25
## 18 ablest      JJS   able        adjective           1
## 19 abnormality NN    abnormality noun                1
## 20 abolish     VV    abolish     verb                2

We have now checked the entries against an existing dictionary and removed non-word elements. As such, we are left with a clean dictionary based on George Orwell’s Nineteen Eighty-Four.

2 Extending dictionaries

Extending dictionaries, that is adding additional layers of information or other types of annotation, e.g. url’s to relevant references or sources, is fortunately very easy in R and can be done without much additional computing. To keep this tutorial simply and straight-forward, we will add information about the polarity and emotionality of the words in the dictionary that we have just generated. We can do this by performing a sentiment analysis on the lemmas using the tidytext package.

3 Finding synonyms

4 Going further: crowd-sourced dictionaries with R and Git

We have reached the end of this tutorial and you now know how to create and modify networks in R and how you can highlight aspects of your data.

Citation & Session Info

Schweinberger, Martin. 2020. Lexicography with R. Brisbane: The University of Queensland. url: https://slcladal.github.io/lex.html (Version 2020.09.28).

@manual{schweinberger2020lex,
  author = {Schweinberger, Martin},
  title = {Lexicography with R},
  note = {https://slcladal.github.io/lex.html},
  year = {2020},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/09/28}
}
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] hunspell_3.0         DT_0.15              koRpus.lang.en_0.1-3
## [4] koRpus_0.13-2        sylly_0.1-6          quanteda_2.1.1      
## [7] tidytext_0.2.6       tidyr_1.1.2         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         pillar_1.4.6       compiler_4.0.2     stopwords_2.0     
##  [5] tokenizers_0.2.1   tools_4.0.2        digest_0.6.25      jsonlite_1.7.1    
##  [9] evaluate_0.14      lifecycle_0.2.0    tibble_3.0.3       gtable_0.3.0      
## [13] lattice_0.20-41    pkgconfig_2.0.3    rlang_0.4.7        fastmatch_1.1-0   
## [17] Matrix_1.2-18      cli_2.0.2          crosstalk_1.1.0.1  yaml_2.2.1        
## [21] sylly.en_0.1-3     xfun_0.16          janeaustenr_0.1.5  dplyr_1.0.2       
## [25] stringr_1.4.0      knitr_1.30         htmlwidgets_1.5.1  fs_1.5.0          
## [29] generics_0.0.2     vctrs_0.3.4        grid_4.0.2         tidyselect_1.1.0  
## [33] glue_1.4.2         data.table_1.13.0  R6_2.4.1           fansi_0.4.1       
## [37] rmarkdown_2.3      purrr_0.3.4        ggplot2_3.3.2      magrittr_1.5      
## [41] usethis_1.6.3      scales_1.1.1       SnowballC_0.7.0    ellipsis_0.3.1    
## [45] htmltools_0.5.0    assertthat_0.2.1   colorspace_1.4-1   utf8_1.1.4        
## [49] stringi_1.5.3      RcppParallel_5.0.2 munsell_0.5.0      crayon_1.3.4

References


Main page


Schmid, Helmut. 1994. “TreeTagger-a Language Independent Part-of-Speech Tagger.” Http://Www. Ims. Uni-Stuttgart. De/Projekte/Corplex/TreeTagger/.

———. 2013. “Probabilistic Part-Ofispeech Tagging Using Decision Trees.” In New Methods in Language Processing, 154.

Schmid, Helmut, M Baroni, E Zanchetta, and A Stein. 2007. “The Enriched Treetagger System.” In Proceedings of the Evalita 2007 Workshop.